MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735
MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735arcivanov wants to merge 29 commits intoMariaDB:10.11from
Conversation
|
@arcivanov , a truely impressive first PR. https://mariadb.com/docs/server/server-management/install-and-upgrade-mariadb/installing-mariadb/compiling-mariadb-from-source/compile-and-using-mariadb-with-sanitizers-asan-ubsan-tsan-msan#buildbots-msan-container for help resolving the MSAN errors. A Full ci results - https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4735%2Fhead Do reach out if you need help - https://mariadb.zulipchat.com |
ced75a6 to
57fc52e
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
There's nothing formally wrong with the diff. One remark: this seems like a new feature to me. So, technically, it should be going to main according to policy.
But I'll leave that decision to the final reviewer.
LGTM. Please stay tuned for the final review.
gkodinov
left a comment
There was a problem hiding this comment.
Missed the tests failing. Please make sure buildbot doesn't produce any failures.
|
The tests are being fixed right now. |
08ec2cf to
7fbaf3e
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Nothing more to add. Please stay tuned for the final review.
|
I have added a lot of comments about the design in https://jira.mariadb.org/browse/MDEV-38975 |
f1dc279 to
15023ad
Compare
|
The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue. This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature. Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind? |
Nevermind, I pushed a commit update, hopefully next run is more lucky |
Those tests do sound like usual suspects. Don't worry, review isn't going to stop because of a few tests and we'll retrigger if important. MSAN happy after your first update and still is. Thanks. |
|
This PR is finalized pending review. I don't expect any more architectural changes or refactorings unless requested. |
Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables by storing blob data in variable-length continuation record chains within the existing `HP_BLOCK` structure. **Continuation runs**: blob data is split across contiguous sequences of `recbuffer`-sized records. Each run stores a 10-byte header (`next_cont` pointer + `run_rec_count`) in the first record; inner records (rec 1..N-1) have no flags byte — full `recbuffer` payload. Runs are linked via `next_cont` pointers. Individual runs are capped at 65,535 records (`uint16` format limit); larger blobs are automatically split into multiple runs. **Zero-copy reads**: single-run blobs return pointers directly into `HP_BLOCK` records, avoiding `blob_buff` reassembly entirely: - Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE` - Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer` - Case C (multi-run): walk chain, reassemble into `blob_buff` `HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by `heap_update()` to refresh the caller's record buffer after freeing old chains, preventing dangling pointers. **Free list scavenging**: on insert, the free list is walked read-only (peek) tracking contiguous groups in descending address order (LIFO). Qualifying groups (>= `min_run_records`) are unlinked and used. The first non-qualifying group terminates the scan — remaining data is allocated from the block tail. The free list is never disturbed when no qualifying group is found. **Record counting**: new `HP_SHARE::total_records` tracks all physical records (primary + continuation). `HP_SHARE::records` remains logical (primary-only) to preserve linear hash bucket mapping correctness. **Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read `run_rec_count` from rec 0 and skip entire continuation runs at once. **Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`, `hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading actual blob data via pointer dereference or chain materialization. **SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables (replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`). `remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`. `ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for DISTINCT deduplication support. Fixed undefined behavior in `test_if_cheaper_ordering()` where `select_limit/fanout` could overflow to infinity — capped at `HA_POS_ERROR`. https://jira.mariadb.org/browse/MDEV-38975
…bles group_concat need special code in store() for storing the result in table->blob_storage. This was implemented in Field_blob::store() but not in Field_blob_compressed::store. This was temporarly solved by not using Field_blob_compressed table->s->is_optimizer_tmp_table() would be set. This this however disable Field_blob_compressed for temporary tables that did not need a key for the blob field. Fixed by ensuring that Field_blob::store and Field_blob_compressed::store handles group_concat identically. As this handling is only done for internal temporary tables, we store the data uncompressed for faster usage by group_concat()
The problem was that create_internal_tmp_table() tries to create a normal key for the blob, which does not work. The fix is to force a unique key if BLOB keys was used for the the orignal HEAP table.
|
|
5dde250 to
39eda4e
Compare
Apply Monty's review feedback across HEAP blob implementation:
**`hp_update.c`:**
- Hoist `HP_BLOB_DESC *desc` to block scope, use `desc++` in all three
blob loops instead of re-indexing `&share->blob_descs[i]` each iteration
- Move `new_len` declaration to block scope, remove inner `{ }` wrapper
block, dedent the write-new-chains loop body by one level
- Replace `if (blob_changed[i]) any_changed= TRUE` with branchless
`any_changed|= blob_changed[i]`
- Add braces to rollback `for (j= 0; j < i; j++)` loop body
- Update chain pointer restoration comment to explain `pos` vs `old`
pointer semantics for segmented blobs
- Rename inner loop `new_len` to `cur_len` to avoid shadowing block-scope
`new_len`
**`read_lowendian()` move (F33/F63):**
- Move `read_lowendian()` from `sql/field.h` to `include/my_base.h` so
pure-C storage engines can use it
- Convert `hp_blob_length()` from standalone function in `hp_blob.c` to
`static inline` wrapper in `heapdef.h` calling `read_lowendian()`
- Convert `hp_blob_key_length()` in `hp_hash.c` to `static inline`
wrapper calling `read_lowendian()`
**Test renames** (Monty's naming convention — drop `heap_` prefix):
- `heap_blob.test` → `blob.test`
- `heap_blob_big{1,2,3}.test` → `blob_big{1,2,3}.test`
- `heap_blob_big.inc` → `blob_big.inc`
- `heap_blob_groupby.test` → `blob_group_by.test`
- `heap_blob_ops.test` → `blob_ops.test`
**Other files:** Style fixes from earlier feedback items applied across
`hp_blob.c`, `hp_write.c`, `hp_hash.c`, `hp_scan.c`, `hp_delete.c`,
`ha_heap.cc`, `heapdef.h`, `_check.c`, `heap.h`, `field.h`,
`sql_select.cc`, and test files.
# Conflicts: # sql/mysqld.cc # sql/sql_parse.cc
`hp_alloc_from_tail()` now takes `uint *blocks` (in/out) and allocates a contiguous batch of records from the current leaf block in one call, replacing the per-record inner loop in `hp_write_one_blob()` Step 2. The caller pre-computes the record count needed for the chosen storage format (Case B for `is_only_run`, Case C otherwise), and the function returns however many are available up to the request. The flat if/else-if/else then selects Case A, B, or C based on the actual count. This eliminates the record-by-record extension loop, both contiguity guards with `abort()`, and the Case B extra-record allocation logic, reducing Step 2 from ~170 lines to ~60.
Two fixes in `hp_write_one_blob()`: **Bug fix**: Step 1 free-list contiguity detection failed to update `prev_pos` inside the contiguity branch, so the check `pos == prev_pos - recbuffer` could only detect 2-record groups. The third record was always compared against the original `prev_pos` (2 recbuffers away), causing a false discontinuity. Fix: add `prev_pos = pos` after `run_start = pos`. **Deficiency MariaDB#2**: When Step 2 (tail allocation) fails with `HA_ERR_RECORD_FILE_FULL` and there are still deleted records on the free list, a new Step 3 walks the entire free list accepting any contiguous group (even single slots). Each group is written as a Case C run via `hp_unlink_and_write_run()`. This produces maximally fragmented chains, which are slower to read but correct. Failing with table-full when free slots exist is worse than a fragmented chain. Tests: - `hp_test_freelist-t.c`: 38 unit tests covering contiguity detection (prev_pos bug guard), repeated delete-reinsert cycles, Step 3 scavenge fallback, and true capacity exhaustion - `heap/blob_fallback.test`: MTR test exercising the fallback at SQL level with fragmented free list - Extracted shared `hp_test_helpers.h` from duplicate code in `hp_test_scan-t.c` and `hp_test_freelist-t.c`
`SHOW STATUS LIKE 'Created_tmp%'` counts include the extra temp table created by prepared statement re-execution under `--ps-protocol`. The test already disabled `cursor_protocol` and `ps2_protocol` but missed `ps_protocol`, causing `blob_big1`/`blob_big2`/`blob_big3` to fail on CI builders that run with `--ps-protocol`.
`heap_prepare_hp_create_info()` mishandled blob key segments whose
field type is `Field_blob` or `Field_geom` (as opposed to
`Field_blob_key`).
`Field_blob::key_type()` returns `HA_KEYTYPE_VARBINARY2` (geometry)
or `HA_KEYTYPE_VARTEXT2` (text blobs). Commit `30415846402`
("Introduce Field_blob_key") added logic that stripped `HA_BLOB_PART`
from these segments, assuming they use "VARCHAR packing." This is
wrong: DISTINCT/UNION key fields are `Field_blob` (not
`Field_blob_key`), and their record format is a blob descriptor
(`packlength` bytes of length + 8-byte data pointer), not a varchar
(2-byte length prefix + inline data).
After `HA_BLOB_PART` was stripped, `hp_create.c` normalized
`VARBINARY2` → `VARTEXT1`. The hash function `hp_rec_hashnr()` then
entered the `VARTEXT1` branch, which reads the first 2 bytes as a
varchar length and hashes that many bytes starting at offset 2. For a
geometry blob descriptor, the first 2 bytes are the low bytes of the
WKB data length (e.g. ~100 for a simple polygon), so the hash read
~100 bytes starting inside the 12-byte descriptor — overshooting into
adjacent fields or uninitialized record buffer memory.
This caused MSAN "use-of-uninitialized-value" crashes in
`innodb_gis.1`, `innodb_gis.point_basic`, `main.gis`, and
`innodb_gis.gis` on the `amd64-msan-clang-20` CI builder. Beyond the
MSAN crash, it was also a functional bug: hashing the raw pointer
bytes meant two rows with identical geometry data but different memory
addresses would hash differently, breaking UNION DISTINCT
deduplication.
**Fix**: when `HA_BLOB_PART` is set and the key type is
`VARBINARY2`/`VARTEXT2`, promote to `VARBINARY4`/`VARTEXT4` instead
of stripping the flag. Set `bit_start` to the actual `packlength`
and `length` to `4 + sizeof(pointer)`. `hp_create.c` then normalizes
to `VARTEXT4` and the hash/compare functions use the blob path:
dereference the pointer and operate on the actual data.
3-phase `heap/blob_stress` MTR test exercising free-list fragmentation, continuation chain reuse, and data integrity under sustained mixed DML: - Phase 1: 200-cycle stored procedure — 5 inserts, 2 deletes, 3 updates per cycle with blob sizes cycling through Case A/B/C; shadow table row count verification with `SIGNAL` on mismatch - Phase 2: near-capacity (2 MB) fill/fragment/refill with free-list scavenging, then full-delete reinsert - Phase 3: 20 grow/shrink `UPDATE` cycles (even rows 10-18 KB, odd rows 5-20 bytes) All phases verify blob content integrity via single-character `REPEAT` pattern check. Addresses Monty's F14 feedback.
# Conflicts: # mysql-test/main/metadata.result
`Field_geom::store()`: assert `blob_storage` is not set, catching any future removal of the MDEV-16699 `group_concat` downgrade in `Field_blob::make_new_field()`. `heap_prepare_hp_create_info()`: after `HA_BLOB_PART` promotion, assert key type is `VARTEXT4`/`VARBINARY4` and `bit_start` is 1-4, catching blob key segments that were not promoted from `VARTEXT2`.
d61b485 to
0d6a069
Compare
- `Field_blob::get_key_image_itRAW` and `Field_blob::key_cmp`: cast `local_char_length` to `uint32` in `set_if_smaller()` — safe because `charpos()` is bounded by `blob_length` which is already `uint32`. - `hp_test_hash-t.c`: widen `blob_len` parameter from `uint16` to `size_t` in `build_record()` and `build_mixed_record()` to match `LEX_CUSTRING::length` type.
47b989d to
6e5e5d3
Compare
When `ha_update_tmp_row()` fails with `HA_ERR_RECORD_FILE_FULL` on a HEAP temp table (e.g., `MAX(TEXT)` aggregate growing the blob during GROUP BY accumulation), convert the table to Aria and retry the update — matching the existing INSERT overflow handling. **Mechanism**: `create_internal_tmp_table_from_heap()` copies all rows; `record[0]` write is rejected as duplicate (same GROUP BY key). `get_dup_key()` populates `dup_ref`, then `ha_rnd_pos()` locates the old row in Aria for the update. **Two call sites fixed**: - `end_update()`: switches to `end_unique_update` after conversion - `end_unique_update()`: restores INDEX mode via `rnd_inited` flag `GROUP_CONCAT` does NOT trigger this path — its `update_field()` is `DBUG_ASSERT(0)` (it accumulates internally). `MAX(TEXT)` / `MIN(TEXT)` are the aggregates that write growing blobs via `result_field->store()`.
When `hp_write_one_blob()` fails partway through tail allocation (e.g. `HA_ERR_RECORD_FILE_FULL`), `hp_free_run_chain()` puts the partial chain onto the delete list. These records were just tail-allocated but once on the delete list they could only be reused via free-list scavenging, not tail allocation, so `last_allocated` only grew forward. Add `hp_shrink_tail()` which pops tail-positioned records from the delete list head and decrements `last_allocated`. Crosses block boundaries by locating the previous leaf block via `hp_find_block()` and updating `last_blocks`. Empty blocks stay allocated in the tree (freed at table drop). Add `high_water_allocated` to `HP_BLOCK` to track the peak `last_allocated` before shrinking. In `hp_alloc_from_tail()`, when `block_pos == 0` and `last_allocated < high_water_allocated`, the next leaf block already exists in the tree: reuse it via `hp_find_block()` instead of calling `hp_get_new_block()`, avoiding memory waste from duplicate block allocations. Unit tests (41 new assertions in `hp_test_freelist-t.c`): - Single-block tail reclaim after failed blob insert - Cross-block reclaim (2 blocks, 1 boundary crossing) - 3-block reclaim (2 boundary crossings, `last_blocks` restoration) - Orphaned block reuse: non-blob and blob inserts fill reclaimed blocks without growing `data_length`
For blob tables, `find_unique_row()` previously materialized blobs twice: once during `hp_rec_key_cmp()` (per-segment via `hp_materialize_one_blob()`) and again via `hp_read_blobs()` after the match was found. Reorder the blob path to materialize-then-compare: save the input record, copy the stored candidate, call `hp_read_blobs()` once to materialize all blobs, then compare via `hp_rec_key_cmp()` with `info=NULL` since both records now have direct data pointers. Non-blob tables keep the original fast path unchanged.
Remove `enum hp_blob_format` and `hp_blob_run_format()` indirection. Add `HP_ROW_MULTIPLE_REC` (bit 5) so all three blob storage formats have a dedicated flag bit. Add named inline predicates `hp_is_single_rec()`, `hp_is_zerocopy()`, `hp_is_multi_run()` matching the existing `hp_is_active()`/`hp_has_cont()`/`hp_is_cont()` pattern. Change `hp_write_run_data()` format parameter from enum to `uchar` receiving bit constants directly; simplify flags byte assignment from ternary to bitwise OR. Addresses review feedback F127-F128, F130, F132-F134.
Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine
tables by storing blob data in variable-length continuation record chains
within the existing
HP_BLOCKstructure.Additionally, promote VARCHAR fields whose
octet_length > 32bytes toBLOB when the temp table uses the HEAP engine, storing only actual data
in continuation chains instead of reserving the full declared width in
every fixed-width row. This is the same promotion mechanism that
type_handler_for_tmp_table()applies atCONVERT_IF_BIGGER_TO_BLOB(512 characters), just at a lower byte-based threshold for HEAP.
HASH index key handling for BLOB columns: the HEAP hash functions
(
hp_rec_hashnr(),hp_rec_key_cmp(),hp_key_cmp(),hp_make_key(),hp_hashnr()) are extended to handleHA_BLOB_PARTkey segments,reading actual blob data via pointer dereference or chain materialization.
This enables
GROUP BY,DISTINCT, andUNIONdeduplication to workcorrectly on BLOB/TEXT columns entirely within the MEMORY engine. A hash
pre-check optimization skips expensive blob materialization when hash
values differ, and PAD SPACE collation semantics are preserved for blob
key comparisons.
Continuation runs: blob data is split across contiguous sequences
of
recbuffer-sized records. Multi-record runs store a 10-byte header(
next_contpointer +run_rec_count) in the first record; innerrecords (rec 1..N-1) have no flags byte — full
recbufferpayload.Runs are linked via
next_contpointers. Individual runs are cappedat 65,535 records (
uint16format limit); larger blobs areautomatically split into multiple runs.
Blob storage format (
enum hp_blob_format, decoded byhp_blob_run_format()from the flags byte — single detectionmechanism used by all write, read, free, scan, and check paths):
HP_BLOB_CASE_A_SINGLE_REC): blob fits in one record.No run header — data starts at offset 0, full
visiblebytesavailable. Saves 10 bytes per small blob. Zero-copy:
chain.HP_BLOB_CASE_B_ZEROCOPY): single run, multiple records.Header in rec 0, data contiguous in rec 1..N-1.
Zero-copy:
chain + recbuffer.HP_BLOB_CASE_C_MULTI_RUN): one or more runs linked vianext_cont. Header + data in each run. Requires reassembly intoblob_buff.HP_INFO::has_zerocopy_blobstracks zero-copy state; used byheap_update()to refresh the caller's record buffer after freeingold chains, preventing dangling pointers. Under HEAP's current
table-level locking (
thr_lock), other handles are blocked duringUPDATE, so cross-handle dangling pointers cannot occur. For internal
temporary tables (single handle, single thread), this is definitively
safe.
Unchanged blob optimization:
heap_update()compares each blobcolumn's old and new values before rewriting chains. Detection order:
length (O(1)), data pointer (O(1)),
memcmp(O(n) with early exit).Unchanged blobs keep existing chains with no allocation, copy, or free.
Only changed blobs get new chains written.
Free list scavenging: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>=
min_run_records) are unlinked and used. Thefirst non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.
Record counting: new
HP_SHARE::total_recordstracks all physicalrecords (primary + continuation).
HP_SHARE::recordsremains logical(primary-only) to preserve linear hash bucket mapping correctness.
Scan/check batch-skip:
heap_scan()andheap_check_heap()usehp_blob_run_format()to detect the storage format. Case A skips onerecord; Case B/C read
run_rec_countfrom the header and skip theentire run at once.
Blob key format bridge:
Field_blob::new_key_field()returnsField_varstring(2B length + inline data), but HEAP expectshp_make_keyformat (4B length + data pointer). PrecomputedHP_KEYDEF::has_blob_segflag triggers key rebuild fromrecord[0]via
hp_make_key()inha_heap::index_read_map()and related methods.GROUP BY key setup sets
key_part_flagfromfield->key_part_flag()(not hardcoded 0) so
HA_BLOB_PARTreaches HEAP's blob code paths.For promoted blob columns where
Field_blob::key_length()returns 0,the GROUP BY key field uses the item's
max_lengthinstead, matchingthe buffer size allocated by
calc_group_buffer().SQL layer:
choose_engine()no longer rejects HEAP for blob tables(replaced
blob_fieldscheck withreclength > HA_MAX_REC_LENGTH).remove_duplicates()routes HEAP+blob toremove_dup_with_compare().ha_heap::remember_rnd_pos()/restart_rnd_next()implemented forDISTINCT deduplication support. Fixed undefined behavior in
test_if_cheaper_ordering()whereselect_limit/fanoutcould overflowto infinity — capped at
HA_POS_ERROR.VARCHAR→BLOB promotion: extracted
pick_engine()fromchoose_engine(), called early instart()to setm_heap_expected.VARCHAR fields with
octet_length > HEAP_CONVERT_IF_BIGGER_TO_BLOB(32 bytes) are promoted to BLOB via
Tmp_field_param::varchar_to_blob_threshold.For I_S tables, the same promotion is applied in
add_schema_fields()using the field definition's octet length.
https://jira.mariadb.org/browse/MDEV-38975